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CONTEXTUAL CLASSIFICATION OF 
MULT1SPECTRAL IMAGE DATA: 
APPROXIMATE ALGORITHM 


James C. Tilton 


School of Electrical Engineering 
and 

Laboratory for Applications of Remote Sensing 

Purdue University 
West Lafayette 47907, U.S.A. 


ABSTRACT 

Earlier reports[2,3] have introduced a classification algo- 
rithm incorporating spatial context information in a general, sta- 
tistical manner. Here an approximation to that algorithm is 
presented which is computationally less intensive, yet produces 
classifications that are nearly as accurate. 


I. INTRODUCTION 

The most widely used method for classifying remotely sensed data from 
such sources as multispectral scanners on aircraft or satellite platforms is a 
point-by-point classification technique in which data from each pixel in the 
scene are classified individually and and independently[l]. The information nor- 
mally used by this classifier is only spectral or, in some cases, spectral and tem- 
poral. There is generally no provision for using spatial information. 

In contrast, when scanner data are displayed in image form, a human 
analyst routinely uses spatial context to help decide what is in the imagery. 
Using context, he or she may be able to easily pick out roads, delineate boun- 
daries of agricultural fields, and differentiate between grass in an urban setting 
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(lawns) and grass in an agricultural setting (pasture or forage crops) where a 
pointrby-point classifier would have much difficulty in doing so. 

Earlier reports[2,3]’ describe the development of a statistical classification 
algorithm which incorporates' spatial context information in a general manner. 
This algorithm exploits the tendency alluded to above of certain ground-cover 
classes to be more likely to occur in some contexts than in: others. 

i • ■ 

This contextual classification algorithm is very computationally intensive, 
typically requiring a large amount of computer time. To reduce execution time, 
one could exploit the latest improvements in the raw speed of computer com- 
ponents and/or one could take advantage of' special' computer architectures 
involving multiple processing elements[2,4]. An alternative tactic discussed 
here is to look for a less computationally intensive algorithm, which approxi- 
mates the original contextual classification algorithm. If- such an algorithm pro- 
duces classifications that do not differ significantly in accuracy from the original 
algorithm, the approximate algorithm would be the preferred algorithm in prac- 
tical applications using conventional (serial) computers*. 

II. ORIGINAL ALGORITHM 

In order to fully discuss the approximate algorithm, a brief description of 
the original algorithm must be given. For a detailed derivation of the decision 
function used in the original algorithm, see [2], 

Consistent with the general characteristics of imaging systems for remote 
sensing, we assume a two-dimensional array of N=N random observations 
Xij having fixed but unknown classifications i5y , : as shown in Figure 1. The obser- 
vation Xij consists' of n measurements .(usually- containing spectral and/or tem- 
poral information), while the classification can be any one of m spectral or 
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information classes from the set Cl = 


#11 

#12 1 ' 

# 1*8 

021 

#22 ’ ‘ 

’ # 2*3 

#*, 1 


#* 1*8 


Figure 1. A two-dimensional array of pixels. 


Let X denote a vector whose components are the ordered observations: 

X = \Xij | i = 1,2, = 

Similarly, let -0 be the vector of states (true classifications) associated with the 
observations in X: 

Let the action (classification) taken with respect to pixel (i,j) be denoted by 
oyefl. We restrict the decision function ay(-) to depend only on an arbitrary but 
fixed subset of p observations in X. This subset includes, along with Xy, p-1 
observations spatially near to, but not necessarily adjacent to, Xy. These p-1 
observations serve as the spatial context for and are taken from the same 
spatial positions relative to pixel position (i.j) for all i and j. Call this arrange- 
ment of pixels together with Xy the p-context array, several examples of which 
are shown in Figure 2. Group the p observations in the p-context array into a 
vector of observations Xy and let be the vector of true but unknown 
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classifications associated with the observations in Xq. Let "tiP eQP stand for p- 
vectbrs; of classes. Each component 6f IP is a variable which can take on any 
classification value. Note that is the particular instance, of ^ associated with 
pixel position (i,j). Correspondence of the components of i?^ t Xy and OP to the 
positions in> the p-context array is. fixed but arbitrary- except that the pixel to be 
classified will always correspond to the p 01 component. 



a p=3 choice 


a p=5 choice 

Figure 2. Examples of p-context arrays. 

Our optimal decision rule now has the form- 

o ij (X) = d(X ij ) (1) 

for a fixed function d(') mapping p-vectors of observations to actions. In deriv- 
ing the explicit decision function we assume that the distribution of X is such 
that every Xy for which ^=‘0% has the same marginal density, i.e. the marginal 
densities depend only on the measurement values in X%j and the set of 
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classifications in and not the location (i,j). Under this assumption the margi- 
nal density becomes 

= )• ( 2 ) 

Utilizing a "0-1 loss function", the decision rule becomes: 

d(Xy) — the action a which maximizes 

2 G{f)f{X v \<P) (3) 

tfPeOP. — “ 

V B 

where 0 (i5 p ), the context distribution, is the relative frequency with which iP 
occurs in the array 

One way to satisfy the assumption resulting in the relation expressed by 
equation (2) is to assume class-conditional independence for X. In this case, the 
marginal density becomes 

ft/WK) (4) 

~ “ k = \ 

where X \ and are the k th elements of Xy and IP , respectively. There may be 

other densities for X with the necessary property, but it is not apparent how one 
could construct a useful density without making possibly inconsistent assump- 
tions. Invoking the class-conditional independence assumption, the decision 
rule in equation (3) becomes: 


( 5 ) 


d{Xij) — the action a which maximizes 

2 c{f)fif(x k \* k ). 

■cPcnp. k=i 
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The optimal choice of d(-) cannot be implemented in practice since it 
depends on C(i? p ) and the f (X k | t?* ) which are unknown. Methods for estimating 
the f (X k are well established from considerable experience in using the 
conventional no-context maximum likelihood decision rule[l]. Methods for 
estimating G('0 P ) from the Xij and the effectiveness of these estimations are dis- 
cussed in the earlier reports[2,3], and are the subject of ongoing research. 

III. APPROXIMATE ALGORITHM 


To come -up yqth a reasonable approximate algorithm, one must examine 
the computer implementation of the original decision function*. Consider the 
case where the set 0 is defined over spectral classes and the class-conditional 


independence assumption is taken. For this case it is reasonable to assume the 
densities f (X k |iV) in equation (5) to be multivariate normal with mean vector 
M$ k and covariance matrix giving 


f(Xk\* k ) = 


2tt 


Stfj" 2 exp[-%(X k -M<, t ) T Z^(X k -M^)] 


where n is the dimensionality of the observation X k (see [l] for the rationale 
behind this assumption in the no-context case). Using the multivariate normal 
assumption, equation (5) becomes 


d(Xxj ) = the action a which maximizes d a (Xij) 


where 


* For this study, the algorithm was iipplemented on a PDP-11/45 computer in the program- 
ming language M C". Test runs were also made on a PDP- 11/70. computer. 
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*.(*„) = S 

~ U PtOP. k=1 


2rc 


SB. 

2 


S*» I -M,J r Si t '(X„ 




£3. 


Let diiXij) = ln[d tt (Jfy)x(2tt) 2 ]. Maximizing ) is equivalent to max- 

imizing d a (Xij). Letting Q* t (X k ) = (X k -1 \f^ k ) T, L^(X k -M^ k ), we have 


d*(X i3 ) = In 


E C(^)fHE„ t r 2 exp[-^(^)] 

<tCP. ~ *»i 


iSPeOP. 

R>„=a 


= In 


E ex P 


InC (f )-» JjJn | 2„ t | +Q« t (X k )] 




= In 


E exp 


iTSPenP. 

P,-a 


where 


F(X iJt f) * InC Of* >-^£1^ 1 ] )] 

In the simulated and real data sets studied (see the earlier reports[2,3]), 
the term exp[.F(Jf<j.iJ ,> )] ranges over a larger negative exponential range than 
available on the PDP-11/45 (an exponential range of 10 ±37 is available). To cir- 
cumvent this problem it was necessary to use the following procedure. 

Let 


M a (Xij ) 1 ma xF(X v ,&) 

— ~ UPeOP. — — 

V a 
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and rewrite d' a {Xq ) as follows: 


<C*«f) = In 


expJX(* y )] £ exp 


KSPcfiP, 

V a 


- M a (Xij ) + In 


£ exp [FiXv.WyMaiXv)] 


MPetP’, 

frp=a 


( 6 ) 


Calculating da (Ay) in this way insures that at least one term of the sum does not. 
cause underflow because the exponent pf the maximum term, M a (Xy), is never 
taken. This procedure also makes it less likely that other terms in the sum will 
cause underflow (the F(Xij>'d p ) tend to be large negative numbers). 

In checking out this particular implementation of the decision function, it 
was noted that M a (Xy) was in most cases significantly larger than the loga- 
rithmic term in equation (6). This observation suggested the following approxi- 
mation of the decision function first given in equation (5): 

d(Xjj) 5 ? the action a which maximizes AfaC^y)* (7a) 

or in the notation of equation (5): 

dL(Xij) - the action a which maximizes for all with Hip -a 


G(iP)flf(X k \‘d k ). (7b) 

— k . v 

Comparing equations (6) and (7a) one can see that the implementation of 
equation (7a) requires less computation and storage than equation (6). In equa- 
tion (7a), the logarithmic term in equation (6) need not be calculated and the 
individual values pf F{Xyj,^ ) for a particular action a need not be stored; only 
the maximum value is needed. We would expect, then, that this approximate 
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algorithm will take less computation time than the original algorithm for any 
data set. The effect of the approximation on classification accuracy, however, 
may be data dependent. 


IV. EXPERIMENTAL RESULTS 

It now remains to be tested empirically whether the lower computational 
and storage requirements of the approximate algorithm result in a significant 
savings in computer costs when compared to the original algorithm, and 
whether the classifications produced by the approximate algorithm differ 
significantly from the classifications produced by the original algorithm. 

The approximate algorithm was compared for accuracy with the original 
algorithm in tests using the simulated data set and the real data sets described 
in [2]. (The real data sets will be subseqiientially referred to as the LACIE and 
Bloomington data sets.) Included in the comparisons were algorithms that take 
only the three or five maximum terms in the summation in equation (5). These 
additional algorithms serve to give an indication of how many terms in the sum- 
mation are needed to produce classifications equivalent to those produced by 
the original algorithm. The results of this study are summarized in Table 1. The 
context distribution for the simulated data set test was estimated by tabulation 
from the template classification from which the simulated data set was gen- 
erated and the context distribution for the LACIE data set was tabulated from 
the first 25 lines of a ground-truth-guided no-context classification* as described 

in [2]. Both data sets were evaluated over the entire 50-pixel square area. The 

• A ground-truth guided classification is performed just like the usual no-context 
classification except that the classifier is restricted to selecting spectral classes from the in- 
formation class indicated by the ground truth data. 




context distribution for the Bloomington data set was tabulated from entire 50- 
pixel square area of a ground- truth-guided no-context classification. Since the 
Bloomington data set has only 1317 ground-truth pixels', the ground-truth- 
guided classification was allowed to degenerate to the usual unguided no-context 
classification over the remaining il83 pixels. The Bloomington data set was 
evaluated over the 1317 ground-truth pixels. Eight-nearest-neighbor context 
was used in all 'cases.’ 
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set as described in [3]. Also displayed are the accuracy results from applying 
the Power Method in a similar manner to the LACIE data set. 


Table 2 

PERFORMANCE OF APPROXIMATION ALGORITHM IN TERMS OF ACCURACY 
Context distribution estimated using Power Method. 

Data Set 

Overall Accuracy, % 

Original Algorithm, Approximate Algorithm, 

Equation (5) Equation (7a&b) 

Bloomington 

LACIE 

88.46 88.38 

86.70 86.66 


Again the approximate algorithm produced overall accuracies that were 
very close to those produced by the original algorithm. To put these minor 
accuracy differences in proper perspective, it helps to note that a conventional 
uniform-priors no-context classifier produced overall accuracies of 83.07% on 
the Bloomington data set and 78.73% on the LACIE data set. 

The approximate algorithm was compared in terms of computer timings 
with the original algorithm on the simulated data set and the two real Landsat 
data sets. Highly optimized versions of each algorithm (written in the "C" pro- 
gramming language) were run on PDP-11/45 and PDP-11/70 computers. Also 
compared to these two algorithms was a highly optimized version of the original 
algorithm that simply allowed the underflows to occur rather than attempting to 
circumvent the underflows. This version allows comparison of the approximate 
algorithm to a simulated implementation of the original algorithm on a com- 
puter with adequate exponential range. 
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Table 3 


PERFORMANCE OF APPROXIMATION ALGORITHM IN TERMS OF TIMINGS 
(50-pixel square LACIE data set, two-nearest-neighbpr context, 
480 nonzero elements in context distribution, PDP-ll/45 computer) 


Classifier 


Time in Seconds 


Real+ 


User+ 


Original Algorithm 
with underflow protection 

Original Algorithm 
without underflow protection 

Approximation Algorithm 


2993 

2498 

1247 


2636 

2388 

1185 


The length of time the classifier took to process the 50-pixel square data 
sets varied depending primarily on the number of nonzero elements of the con- 
text distribution. (The number of terms that need to be evaluated in the sum in 
equation (5) and the number of terms to be compared in the maximization of 
equation (7b) are equal to the number of nonzero elements in the context distri- 
bution.) The ratio of timings between the three programs remained fairly con- 
sistent, however, across all data sets. Tables 3 and 4 display typical quiet sys- 
tem* timings on a PDP-11/45 computer for cases of few nonzero elements of the 
context distribution (480) and relatively large number of nonzero elements 
(2193). Table 5 gives the timings for the case displayed in Table 4, but run on a 
PDP-1 1/70 computer. 

+ Real time is the time the program is running in the computer including the time the pro- 
gram is swapped out for other tasks. User time is essentially time spent doing computa- 
tions. 

• The runs were made during early morning hours when few other tasks were being per- 
formed by the computer. 







- 13- 


Table 4 

PERFORMANCE OF APPROXIMATION ALGORITHM IN TERMS OF TIMINGS 
(50-pixel square simulated data set, two-nearest-neighbor context, 
2193 nonzero elements in context distribution, PDP-11/45 computer) 

Classifier 

Time in Seconds 

Real 

User 

Original Algorithm 
with underflow protection 

18596 

14702 

Original Algorithm 
without underflow protection 

15064 

14290 

Approximation Algorithm 

9079 

8675 


Table 5 

PERFORMANCE OF APPROXIMATION ALGORITHM IN TERMS OF TIMINGS 
(50-pixel square simulated data set, two-nearest-neighbor context, 
2193 nonzero elements in context distribution, PDP-11/70 computer) 

Classifier 

Time in Seconds 

Real . 

User 

Original Algorithm 
with underflow protection 

7240 

5832 

Original Algorithm 
without underflow protection 

6830 

6573 

Approximation Algorithm 

2747 

2526 


The three tables show that the approximate algorithm averaged less than 
half the real and user time taken by either of the other two algorithms. This 
amounts to a significant improvement in computation time. 












- 14 - 


V. CONCLUDING REMARKS 

The contextual classification algorithm developed in [2] is very computa- 
tionally intensive, typically requiring a large amount of computer time. An ap- 
proximation to this algorithm has been explored in this report. Experimented 
results from one simulated and two real data sets show that on these data sets 
the approximate algorithm takes significantly less computer time while produc- 
ing classifications that do not differ significantly in accuracy from classifications 
produced by the original algorithm. 

By the nature of the approximate algorithm, it is expected that similar time 
savings will occur when the approximate algorithm is used on other data sets. 
Whether or not the accuracy results presented here can be expected with other 
data sets depends on the extent to which the data sets tested here are 
representative of remotely sensed data in general. We expect that they are fair- 
ly representative. Further tests are planned to. confirm that the approximation 
does not significantly affect classification accuracy. 
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